11 research outputs found

    Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

    Full text link
    Neural networks (NNs) are growing in importance and complexity. A neural network's performance (and energy efficiency) can be bound either by computation or memory resources. The processing-in-memory (PIM) paradigm, where computation is placed near or within memory arrays, is a viable solution to accelerate memory-bound NNs. However, PIM architectures vary in form, where different PIM approaches lead to different trade-offs. Our goal is to analyze, discuss, and contrast DRAM-based PIM architectures for NN performance and energy efficiency. To do so, we analyze three state-of-the-art PIM architectures: (1) UPMEM, which integrates processors and DRAM arrays into a single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute bit-serial operations. Our analysis reveals that PIM greatly benefits memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when the GPU requires memory oversubscription for a general matrix-vector multiplication kernel; (2) Mensa improves energy efficiency and throughput by 3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3) SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude that the ideal PIM architecture for NN models depends on a model's distinct attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with arXiv:2109.1432

    Extending Memory Capacity in Consumer Devices with Emerging Non-Volatile Memory: An Experimental Study

    Full text link
    The number and diversity of consumer devices are growing rapidly, alongside their target applications' memory consumption. Unfortunately, DRAM scalability is becoming a limiting factor to the available memory capacity in consumer devices. As a potential solution, manufacturers have introduced emerging non-volatile memories (NVMs) into the market, which can be used to increase the memory capacity of consumer devices by augmenting or replacing DRAM. Since entirely replacing DRAM with NVM in consumer devices imposes large system integration and design challenges, recent works propose extending the total main memory space available to applications by using NVM as swap space for DRAM. However, no prior work analyzes the implications of enabling a real NVM-based swap space in real consumer devices. In this work, we provide the first analysis of the impact of extending the main memory space of consumer devices using off-the-shelf NVMs. We extensively examine system performance and energy consumption when the NVM device is used as swap space for DRAM main memory to effectively extend the main memory capacity. For our analyses, we equip real web-based Chromebook computers with the Intel Optane SSD, which is a state-of-the-art low-latency NVM-based SSD device. We compare the performance and energy consumption of interactive workloads running on our Chromebook with NVM-based swap space, where the Intel Optane SSD capacity is used as swap space to extend main memory capacity, against two state-of-the-art systems: (i) a baseline system with double the amount of DRAM than the system with the NVM-based swap space; and (ii) a system where the Intel Optane SSD is naively replaced with a state-of-the-art (yet slower) off-the-shelf NAND-flash-based SSD, which we use as a swap space of equivalent size as the NVM-based swap space

    Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Cooperation

    No full text
    A growth in data volume, combined with increasing demand for real-time analysis (using the most recent data), has resulted in the emergence of database systems that concurrently support transactions and data analytics. These hybrid transactional and analytical processing (HTAP) database systems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant losses in transactional (up to 74.6%) and/or analytical (up to 49.8%) throughput compared to performing only transactional or only analytical queries in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation from transactional to analytical workloads, and (3) the cost to maintain a consistent view of data across the system. We propose Polynesia, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements new custom hardware that unlocks software optimizations to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement overheads. Our evaluation shows that Polynesia outperforms three state-of-the-art HTAP systems, with average transactional/analytical throughput improvements of 1.7x/3.7x, and reduces energy consumption by 48% over the prior lowest-energy HTAP system

    Extending Memory Capacity in Modern Consumer Systems With Emerging Non-Volatile Memory: Experimental Analysis and Characterization Using the Intel Optane SSD

    No full text
    DRAM scalability is becoming a limiting factor to the available memory capacity in consumer devices. As a potential solution, manufacturers have introduced emerging non-volatile memories (NVMs) into the market, which can be used to increase the memory capacity of consumer devices by augmenting or replacing DRAM. In this work, we provide the first analysis of the impact of extending the main memory space of consumer devices using off-the-shelf NVMs. We equip real web-based Chromebook computers with the Intel Optane solid-state drive (SSD), which contains state-of-the-art low-latency NVM, and use the NVM as swap space. We analyze the performance and energy consumption of the Optane-equipped Chromebooks, and compare this with (i) a baseline system with double the amount of DRAM than the system with the NVM-based swap space; and (ii) a system where the Intel Optane SSD is naively replaced with a state-of-the-art NAND-flash-based SSD. Our experimental analysis reveals that while Optane-based swap space provides a cost-effective way to alleviate the DRAM capacity bottleneck in consumer devices, naive integration of the Optane SSD leads to several system-level overheads, mostly related to (1) the Linux block I/O layer, which can negatively impact overall performance; and (2) the off-chip traffic to the swap space, which can negatively impact energy consumption. To reduce the Linux block I/O layer overheads, we tailor several system-level mechanisms (i.e., the I/O scheduler and the I/O request completion mechanism) to the currently-running application鈥檚 access pattern. To reduce the off-chip traffic overhead, we leverage an operating system feature (called Zswap) that allocates some DRAM space to be used as a compressed in-DRAM cache for data swapped between DRAM and the Intel Optane SSD, significantly reducing energy consumption caused by the off-chip traffic to the swap space. We conclude that emerging NVMs are a cost-effective solution to alleviate the DRAM capacity bottleneck in consumer devices, which can be further enhanced by tailoring system-level mechanisms to better leverage the characteristics of our workloads and the NVM.ISSN:2169-353

    Fast Bulk Bitwise AND and OR in DRAM

    No full text
    <p>Bitwise operations are an important component of modern day programming, and are used in a variety of applications such as databases. In this work, we propose a new and simple mechanism to implement bulk bitwise AND and OR operations in DRAM, which is faster and more efficient than existing mechanisms. Our mechanism exploits existing DRAM operation to perform a bitwise AND/OR of two DRAM rows completely within DRAM. The key idea is to simultaneously connect three cells to a bitline before the sense-amplification. By controlling the value of one of the cells, the sense amplifier forces the bitline to the bitwise AND or bitwise OR of the values of the other two cells. Our approach can improve the throughput of bulk bitwise AND/OR operations by 9:7X and reduce their energy consumption by 50:5X. Since our approach exploits existing DRAM operation as much as possible, it requires negligible changes to DRAM logic. We evaluate our approach using a real-world implementation of a bit-vector based index for databases. Our mechanism improves the performance of commonly-used range queries by 30% on average.</p
    corecore